set.seed(022)

parameter <- 2.2
lambda <- 1 / parameter

n_values <- c(300, 1e3, 1e4)
alphas <- c(.1, .05, .01)

n <- n_values[1]
exp_small <- matrix(rexp(n^2, lambda), nrow = n)
croissant_small <- matrix(rpois(n^2, parameter), nrow = n)

n <- n_values[2]
exp_medium <- matrix(rexp(n^2, lambda), nrow = n)
croissant_medium <- matrix(rpois(n^2, parameter), nrow = n)

n <- n_values[3]
exp_large <- matrix(rexp(n^2, lambda), nrow = n)
croissant_large <- matrix(rpois(n^2, parameter), nrow = n)

all_exp_samples <- list(exp_small, exp_medium, exp_large)
all_croissant_samples <- list(croissant_small, croissant_medium, croissant_large)

get_stats <- function(sample_means, alpha, left_border, right_border) {
  y <- sample_means
  den <- density(y)
  plot(den, xlab = "Mean value")

  l <- min(which(den$x >= left_border))
  h <- max(which(den$x < right_border))

  polygon(c(den$x[c(l, l:h, h)]),
          c(0, den$y[l:h], 0),
          col = "slateblue1")

  cat("For confidence level", (1 - alpha) * 100, "% CI is [", left_border, ",", right_border, "]", "\n")
  cat("with the length", right_border - left_border, "for the sample with size", length(sample_means),"\n")
  cat("Fraction of means being in that interval: ", length(sample_means[sample_means >= left_border & sample_means <= right_border]) / n, "\n\n")

}

Problem 3

Chi-squared Distribution

p.d.f. of sum of exponential distributions is a Gamma distribution: \(f(x)=\frac{x^{n-1}e^{-\lambda \cdot x} \lambda^{n}}{Г(n)}\)

Then using this article: https://search.r-project.org/CRAN/refmans/EnvStats/html/eexp.html we see that Gamma distribution is half of Chi-squared distribution with twice as many degrees of freedom.

Plug in \(n = k/2\) and \(\lambda = 1/2\): \(f(x) = \frac{x^{k/2-1}e^{-1/2 \cdot x} \cdot \frac{1}{2}^n}{Г(k/2)}\)

A confidence interval turns to \([ \frac{\chi^2(2n, \alpha/2)}{2n\hat{x}} , \frac{\chi^2(2n, 1 - \alpha/2)}{2n\hat{x}} ]\)

for (sample in all_exp_samples){
  n <- sqrt(length(sample))
  # calculating means of n different realizations of size n
  exp_cols_means <- colSums(sample) / n
  exp_mean <- mean(exp_cols_means)
  for (alpha in alphas) {
    left_border <- exp_mean * qchisq(alpha / 2, 2 * n) / (2 * n)
    right_border <- exp_mean * qchisq(1 - alpha / 2, 2 * n) / (2 * n)
    get_stats(exp_cols_means, alpha, left_border, right_border)
}
}

## For confidence level 90 % CI is [ 1.983879 , 2.399166 ] 
## with the length 0.4152865 for the sample with size 300 
## Fraction of means being in that interval:  0.91

## For confidence level 95 % CI is [ 1.946834 , 2.441731 ] 
## with the length 0.4948965 for the sample with size 300 
## Fraction of means being in that interval:  0.94

## For confidence level 99 % CI is [ 1.875782 , 2.526355 ] 
## with the length 0.6505727 for the sample with size 300 
## Fraction of means being in that interval:  0.9866667

## For confidence level 90 % CI is [ 2.086841 , 2.315681 ] 
## with the length 0.2288391 for the sample with size 1000 
## Fraction of means being in that interval:  0.909

## For confidence level 95 % CI is [ 2.06575 , 2.338437 ] 
## with the length 0.2726872 for the sample with size 1000 
## Fraction of means being in that interval:  0.96

## For confidence level 99 % CI is [ 2.024942 , 2.383342 ] 
## with the length 0.3583995 for the sample with size 1000 
## Fraction of means being in that interval:  0.996

## For confidence level 90 % CI is [ 2.164043 , 2.236419 ] 
## with the length 0.07237617 for the sample with size 10000 
## Fraction of means being in that interval:  0.8999

## For confidence level 95 % CI is [ 2.157193 , 2.243435 ] 
## with the length 0.0862418 for the sample with size 10000 
## Fraction of means being in that interval:  0.9484

## For confidence level 99 % CI is [ 2.143848 , 2.25719 ] 
## with the length 0.1133418 for the sample with size 10000 
## Fraction of means being in that interval:  0.9889

We see that as confidence level increases, the interval becomes wider. This is because one has less room for a mistake as he needs more accuracy – so he has to cover more values.

Normal approximation

Parameters of normal distribution: \(\mu = \theta\), \(\sigma^2 = s^2 / n\), where \(s^2 = \theta^2\) is the population variance (\(Var(X) = \frac{1}{\lambda^2}, X: E(\lambda)\)) \(Z = \sqrt{n}(\bar{X} - \theta) / \theta\) is approximately standard normal \(: N(0, 1)\): \(P(|\theta - \hat{X}| < z_{\beta}\cdot \theta / \sqrt{n}) = P(|Z| \le z_{\beta}) = 2\beta - 1\)

for (sample in all_exp_samples){

  n <- sqrt(length(sample))
  exp_cols_means <- colSums(sample) / n
  exp_mean <- mean(exp_cols_means)
  exp_sd_sample <- sd(sample)

  for (alpha in alphas) {
    beta <- (2 - alpha) / 2
    deviation <- (qnorm(beta) * exp_sd_sample / sqrt(n))
    left_border <- exp_mean - deviation
    right_border <- exp_mean + deviation
    get_stats(exp_cols_means, alpha, left_border, right_border)
    }
}

## For confidence level 90 % CI is [ 1.979964 , 2.394792 ] 
## with the length 0.4148273 for the sample with size 300 
## Fraction of means being in that interval:  0.9

## For confidence level 95 % CI is [ 1.940229 , 2.434527 ] 
## with the length 0.4942973 for the sample with size 300 
## Fraction of means being in that interval:  0.9433333

## For confidence level 99 % CI is [ 1.86257 , 2.512186 ] 
## with the length 0.6496167 for the sample with size 300 
## Fraction of means being in that interval:  0.99

## For confidence level 90 % CI is [ 2.085646 , 2.314375 ] 
## with the length 0.2287285 for the sample with size 1000 
## Fraction of means being in that interval:  0.91

## For confidence level 95 % CI is [ 2.063737 , 2.336284 ] 
## with the length 0.2725468 for the sample with size 1000 
## Fraction of means being in that interval:  0.959

## For confidence level 99 % CI is [ 2.020917 , 2.379104 ] 
## with the length 0.3581872 for the sample with size 1000 
## Fraction of means being in that interval:  0.996

## For confidence level 90 % CI is [ 2.163916 , 2.236296 ] 
## with the length 0.07238004 for the sample with size 10000 
## Fraction of means being in that interval:  0.8997

## For confidence level 95 % CI is [ 2.156983 , 2.243229 ] 
## with the length 0.08624613 for the sample with size 10000 
## Fraction of means being in that interval:  0.9479

## For confidence level 99 % CI is [ 2.143432 , 2.256779 ] 
## with the length 0.1133466 for the sample with size 10000 
## Fraction of means being in that interval:  0.9887

Normal approximation independent of parameter \(\theta\)

\(|\theta - \bar{X}| \le z_{\beta}\theta/\sqrt{n}\)

\(z\_{\beta}\theta/\sqrt{n}\le \theta - \bar{X} \le z\_{\beta}\theta/\sqrt{n}\)

\(z\_{\beta}/\sqrt{n}\le 1 - \bar{X}/\theta \le z\_{\beta}/\sqrt{n}\)

\(z\_{\beta}/\sqrt{n} - 1\le - \bar{X}/\theta \le z\_{\beta}/\sqrt{n} -1\)

\(\frac{z_{\beta} + \sqrt{n}}{\sqrt{n} \bar{X}}\le - 1/\theta \le \frac{z_{\beta} - \sqrt{n}}{\sqrt{n} \bar{X}}\)

\(\frac{\sqrt{n} \bar{X}}{ \sqrt{n} + z_{\beta}} \le \theta \le \frac{\sqrt{n} \bar{X}}{ \sqrt{n} - z_{\beta}}\)

for (sample in all_exp_samples){

  n <- sqrt(length(sample))
  exp_cols_means <- colSums(sample) / n
  exp_mean <- mean(exp_cols_means)

  for (alpha in alphas) {
    beta <- (2 - alpha) / 2
    numerator <- sqrt(n) * exp_mean
    left_border <- numerator / (sqrt(n) + qnorm(beta))
    right_border <- numerator / (sqrt(n) - qnorm(beta))
    get_stats(exp_cols_means, alpha, left_border, right_border)
  }
}

## For confidence level 90 % CI is [ 1.997668 , 2.416901 ] 
## with the length 0.4192325 for the sample with size 300 
## Fraction of means being in that interval:  0.8933333

## For confidence level 95 % CI is [ 1.965019 , 2.466482 ] 
## with the length 0.5014623 for the sample with size 300 
## Fraction of means being in that interval:  0.9466667

## For confidence level 99 % CI is [ 1.904195 , 2.569503 ] 
## with the length 0.6653085 for the sample with size 300 
## Fraction of means being in that interval:  0.9833333

## For confidence level 90 % CI is [ 2.091235 , 2.320722 ] 
## with the length 0.2294872 for the sample with size 1000 
## Fraction of means being in that interval:  0.915

## For confidence level 95 % CI is [ 2.071613 , 2.345376 ] 
## with the length 0.2737627 for the sample with size 1000 
## Fraction of means being in that interval:  0.957

## For confidence level 99 % CI is [ 2.034306 , 2.395103 ] 
## with the length 0.360797 for the sample with size 1000 
## Fraction of means being in that interval:  0.994

## For confidence level 90 % CI is [ 2.164503 , 2.236899 ] 
## with the length 0.07239662 for the sample with size 10000 
## Fraction of means being in that interval:  0.9

## For confidence level 95 % CI is [ 2.157813 , 2.244089 ] 
## with the length 0.0862757 for the sample with size 10000 
## Fraction of means being in that interval:  0.9477

## For confidence level 99 % CI is [ 2.144858 , 2.258275 ] 
## with the length 0.1134172 for the sample with size 10000 
## Fraction of means being in that interval:  0.9894

Student’s t-distribution

Using formula from the lecture:
\(\hat{X} - \frac{S}{\sqrt{n}} \cdot t_{1-\frac{\alpha}{2}}^{n-1} \le \theta \le \hat{X} + \frac{S}{\sqrt{n}} \cdot t_{1-\frac{\alpha}{2}}^{n-1}\)

Here \(S\) stands for the best estimation of the standard deviation one can get (since one has no information about the whole population) \(S^2 = \frac{1}{n-1} \sum_{i} {(\hat{X} - X_i)}^2\)

for (sample in all_exp_samples){

  n <- sqrt(length(sample))
  exp_cols_means <- colSums(sample) / n
  exp_mean <- mean(exp_cols_means)
  exp_sd_means <- sd(exp_cols_means)

  for (alpha in alphas) {
    deviation <- exp_sd_means * qt(1 - alpha / 2, n - 1)
    left_border <- exp_mean - deviation
    right_border <- exp_mean + deviation
    get_stats(exp_cols_means, alpha, left_border, right_border)
  }

}

## For confidence level 90 % CI is [ 1.974627 , 2.400129 ] 
## with the length 0.4255028 for the sample with size 300 
## Fraction of means being in that interval:  0.92

## For confidence level 95 % CI is [ 1.933627 , 2.441129 ] 
## with the length 0.5075012 for the sample with size 300 
## Fraction of means being in that interval:  0.9466667

## For confidence level 99 % CI is [ 1.85311 , 2.521646 ] 
## with the length 0.6685359 for the sample with size 300 
## Fraction of means being in that interval:  0.99

## For confidence level 90 % CI is [ 2.087894 , 2.312127 ] 
## with the length 0.2242331 for the sample with size 1000 
## Fraction of means being in that interval:  0.901

## For confidence level 95 % CI is [ 2.066377 , 2.333643 ] 
## with the length 0.2672662 for the sample with size 1000 
## Fraction of means being in that interval:  0.956

## For confidence level 99 % CI is [ 2.024264 , 2.375757 ] 
## with the length 0.3514933 for the sample with size 1000 
## Fraction of means being in that interval:  0.995

## For confidence level 90 % CI is [ 2.163761 , 2.23645 ] 
## with the length 0.07268897 for the sample with size 10000 
## Fraction of means being in that interval:  0.9013

## For confidence level 95 % CI is [ 2.156797 , 2.243414 ] 
## with the length 0.08661671 for the sample with size 10000 
## Fraction of means being in that interval:  0.9495

## For confidence level 99 % CI is [ 2.143185 , 2.257026 ] 
## with the length 0.1138416 for the sample with size 10000 
## Fraction of means being in that interval:  0.9891
In all tested cases all approaches did great. But the most universal for general use is the Student’s t-distribution. The accuracy was very similar in all cases. So the question is more about what data one has to manipulate with it correctly and make correct predictions. And if sample is big enough, the precision is going to be great regardless of the chosen method.

Problem 4

Repeat parts (2)–(4) of Problem 3 (with corresponding amendments) for a Poisson distribution.

Normal approximation

Parameters of normal distribution: \(\mu = \theta\), \(\sigma^2 = s^2 / n\), where \(s^2 = \theta\) is the population variance (\(Var(X) = E(X) = {\lambda}\)). \(Z := \sqrt{n}(\bar{X} - \theta) / \sigma\) is approximately standard normal \(: N(0, 1)\).\(P(|\theta - \hat{X}| < z_{\beta}\cdot \sigma / \sqrt{n}) = P(|Z| \le z_{\beta}) = 2\beta - 1\), in other words,\(\theta\) is with probability \(2\beta - 1\) within \(X ± z_{\beta}\sqrt{\theta} \sqrt{n}\).

for (sample in all_croissant_samples){

  n <- sqrt(length(sample))
  croissant_cols_means <- colSums(sample) / n
  croissant_mean <- mean(croissant_cols_means)
  croissant_sd_sample <- sd(sample)

  for (alpha in alphas) {
    beta <- (2 - alpha) / 2
    deviation <- (qnorm(beta) * croissant_sd_sample / sqrt(n))
    left_border <- croissant_mean - deviation
    right_border <- croissant_mean + deviation
    get_stats(croissant_cols_means, alpha, left_border, right_border)
  }

}

## For confidence level 90 % CI is [ 2.051625 , 2.333975 ] 
## with the length 0.2823495 for the sample with size 300 
## Fraction of means being in that interval:  0.8966667

## For confidence level 95 % CI is [ 2.02458 , 2.36102 ] 
## with the length 0.3364402 for the sample with size 300 
## Fraction of means being in that interval:  0.9533333

## For confidence level 99 % CI is [ 1.971721 , 2.413879 ] 
## with the length 0.4421573 for the sample with size 300 
## Fraction of means being in that interval:  0.9966667

## For confidence level 90 % CI is [ 2.122485 , 2.276835 ] 
## with the length 0.1543499 for the sample with size 1000 
## Fraction of means being in that interval:  0.877

## For confidence level 95 % CI is [ 2.1077 , 2.29162 ] 
## with the length 0.1839192 for the sample with size 1000 
## Fraction of means being in that interval:  0.936

## For confidence level 99 % CI is [ 2.078805 , 2.320515 ] 
## with the length 0.2417109 for the sample with size 1000 
## Fraction of means being in that interval:  0.989

## For confidence level 90 % CI is [ 2.175724 , 2.224519 ] 
## with the length 0.0487948 for the sample with size 10000 
## Fraction of means being in that interval:  0.8951

## For confidence level 95 % CI is [ 2.17105 , 2.229193 ] 
## with the length 0.05814259 for the sample with size 10000 
## Fraction of means being in that interval:  0.9498

## For confidence level 99 % CI is [ 2.161916 , 2.238328 ] 
## with the length 0.07641232 for the sample with size 10000 
## Fraction of means being in that interval:  0.9908

Normal approximation independent of parameter \(\theta\)

The confidence interval constructed above uses the unknown variance \(s^2 = θ^2\) and is of little use in practice. Instead, we can solve the double inequality for \(θ\) and get another confidence interval of confidence level \(2β −1\) that is independent of the unknown parameter.

\[ |\theta - \bar{X}| \le z_{\beta}\sigma/\sqrt{n} \]

\[ \theta = \sigma^2 => \sigma = \sqrt{\theta} \]

\[ -\frac{z_{\beta}\sqrt{\theta}}{\sqrt{n}}\le \theta - \bar{X} \le \frac{z_{\beta}\sqrt{\theta}}{\sqrt{n}} \]

\[ ...\\ ...\\... \]

\[ |\sqrt{\theta} - \frac{\sqrt{\frac{z^2_\beta}{n}} + 4\bar{X}}{2}| \le \frac{z_\beta}{2\sqrt{n}} \]

The rest is easy as all Probability and Statistics course in general.

for (sample in all_croissant_samples){

  n <- sqrt(length(sample))
  croissant_cols_means <- colSums(sample) / n
  croissant_mean <- mean(croissant_cols_means)

  for (alpha in alphas) {
    beta <- (2 - alpha) / 2
    z_val <- qnorm(beta)

    common <- sqrt(z_val^2 / n + 4 * croissant_mean ) / 2
    deviation <- 0.5 * z_val / sqrt(n)

    left_border <- (common - deviation) ^ 2
    right_border <- (common + deviation) ^ 2
    get_stats(croissant_cols_means, alpha, left_border, right_border)
  }

}

## For confidence level 90 % CI is [ 2.056611 , 2.338008 ] 
## with the length 0.2813969 for the sample with size 300 
## Fraction of means being in that interval:  0.9

## For confidence level 95 % CI is [ 2.031514 , 2.366891 ] 
## with the length 0.3353774 for the sample with size 300 
## Fraction of means being in that interval:  0.96

## For confidence level 99 % CI is [ 1.983361 , 2.424355 ] 
## with the length 0.4409941 for the sample with size 300 
## Fraction of means being in that interval:  0.9966667

## For confidence level 90 % CI is [ 2.123856 , 2.278169 ] 
## with the length 0.1543127 for the sample with size 1000 
## Fraction of means being in that interval:  0.879

## For confidence level 95 % CI is [ 2.109637 , 2.293524 ] 
## with the length 0.1838868 for the sample with size 1000 
## Fraction of means being in that interval:  0.938

## For confidence level 99 % CI is [ 2.082124 , 2.323831 ] 
## with the length 0.2417066 for the sample with size 1000 
## Fraction of means being in that interval:  0.986

## For confidence level 90 % CI is [ 2.175859 , 2.224655 ] 
## with the length 0.04879634 for the sample with size 10000 
## Fraction of means being in that interval:  0.8947

## For confidence level 95 % CI is [ 2.171241 , 2.229386 ] 
## with the length 0.05814481 for the sample with size 10000 
## Fraction of means being in that interval:  0.9493

## For confidence level 99 % CI is [ 2.162245 , 2.238662 ] 
## with the length 0.07641644 for the sample with size 10000 
## Fraction of means being in that interval:  0.9906

Student’s t-distribution

More universal approach to get rid of the dependence on \(θ\) in the previous task is to estimate \(s\) via the sample standard error and use approximation of \(\bar{X}\) via Student t-distribution. Using formula from the lecture:
\(\hat{X} - \frac{S}{\sqrt{n}} \cdot t_{1-\frac{\alpha}{2}}^{n-1} \le \theta \le \hat{X} + \frac{S}{\sqrt{n}} \cdot t_{1-\frac{\alpha}{2}}^{n-1}\)

Here \(S\) stands for the best estimation of the standard deviation one can get (since one has no information about the whole population) \(S^2 = \frac{1}{n-1} \sum_{i} {(\hat{X} - X_i)}^2\)

for (sample in all_croissant_samples){

  n <- sqrt(length(sample))
  croissant_cols_means <- colSums(sample) / n
  croissant_mean <- mean(croissant_cols_means)
  croissant_sd_means <- sd(croissant_cols_means)

  for (alpha in alphas) {
    deviation <- croissant_sd_means * qt(1 - alpha / 2, n - 1)
    left_border <- croissant_mean - deviation
    right_border <- croissant_mean + deviation
    get_stats(croissant_cols_means, alpha, left_border, right_border)
  }

}

## For confidence level 90 % CI is [ 2.052361 , 2.333239 ] 
## with the length 0.2808774 for the sample with size 300 
## Fraction of means being in that interval:  0.8933333

## For confidence level 95 % CI is [ 2.025297 , 2.360303 ] 
## with the length 0.3350051 for the sample with size 300 
## Fraction of means being in that interval:  0.9533333

## For confidence level 99 % CI is [ 1.972147 , 2.413453 ] 
## with the length 0.4413053 for the sample with size 300 
## Fraction of means being in that interval:  0.9966667

## For confidence level 90 % CI is [ 2.120215 , 2.279105 ] 
## with the length 0.1588894 for the sample with size 1000 
## Fraction of means being in that interval:  0.895

## For confidence level 95 % CI is [ 2.104969 , 2.294351 ] 
## with the length 0.1893822 for the sample with size 1000 
## Fraction of means being in that interval:  0.945

## For confidence level 99 % CI is [ 2.075128 , 2.324192 ] 
## with the length 0.2490647 for the sample with size 1000 
## Fraction of means being in that interval:  0.99

## For confidence level 90 % CI is [ 2.175641 , 2.224602 ] 
## with the length 0.04896141 for the sample with size 10000 
## Fraction of means being in that interval:  0.8966

## For confidence level 95 % CI is [ 2.17095 , 2.229293 ] 
## with the length 0.05834277 for the sample with size 10000 
## Fraction of means being in that interval:  0.9507

## For confidence level 99 % CI is [ 2.161781 , 2.238462 ] 
## with the length 0.07668075 for the sample with size 10000 
## Fraction of means being in that interval:  0.9911

For Poisson distribution the conclusions are pretty similar. All approaches performed good in all of the tests. However, the Student’s t-distribution is the most widely applicable for general use. In every instance, the accuracy was remarkably consistent. So the question is more about what data one has to work with in order to make accurate predictions. Furthermore, regardless of the method used, if the sample is large enough, the precision will be good enough.